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FOREWORD 


One  of  the  goals  of  Air  Force  Electronic  Systems  Division 
is  the  development  of  a  technology  for  computer-based,  personnel- 
support  systems  integrated  into  Air  Force  Information  Systems. 
These  support  systems  are  required  to  improve  the  efficiency 
of  man-computer  interactions  in  the  host  Information  Systems. 

They  are  designed  to  provide  automated  on-the-job  training, 
performance-  and  decision-aiding  for  Information  Systems  per¬ 
sonnel. 

Task  280104,  Computer-Aided  Instruction  Techniques,  under 
Project  2801,  Design  Methodology  for  Military  Information  Sys¬ 
tems,  was  established  to  develop  tools  and  techniques  for 
computer-aided  training,  performance-and  decision-aiding  in 
these  systems.  It  is  also  concerned  with  new  software  engi¬ 
neering  techniques  which  will  permit  cost-effective  implementation 
of  these  aids.  This  study  relates  to  the  latter  objective. 

This  report,  one  in  a  series  supporting  Project  2801, 
addresses  the  problem  of  reducing  the  size  of  text  files  which 
constitute  the  bulk  of  the  lesson  files  in  the  typical  computer- 
aided  instruction  (CAI)  systems.  The  approach  is  to  simulate 
a  practical  text  compression  algorithm  and  test  it  against 
CAI  lesson  material.  While  the  orientation  of  this  study  is 
toward  CAI,  the  technique  is  generally  applicable  to  reducing 
the  size  of  text  files  in  other  systems  such  as  data  management, 
command  and  control,  and  intelligence  data  bases. 

This  study  was  performed  by  Captain  J.  M.  Knight,  Jr. 
as  part  of  his  reserve  training  day  duties  between  May  1970 
and  September  1971,  including  two  2-week  active  duty  tours. 

Dr.  Sylvia  R.  Mayer,  ESD/MCIT  suggested  the  study  and  served 
as  Air  Force  Task  Scientist. 

This  Technical  Report  has  been  reviewed  and  is  approved. 


SYLVIA  R.  MAYER,  Ph.D. 
Project  Scientist 


MELVIN  B.  EMMONS,  Colonel,  USAF 
Director,  Information  Systems  Technology 
Deputy  for  Command  6c  Management  Systems 
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ABSTRACT 


This  report  describes  the  initial  evaluation  of  a  text  compression 
algorithm  against  Computer-Aided  Instruction  (CAI)  material.  A 
review  of  some  concepts  related  to  statistical  text  compression  is 
followed  by  a  detailed  description  of  a  practical  text  compression 
algorithm.  A  simulation  of  the  algorithm  was  programmed  and  used 
to  obtain  compression  ratios  for  a  small  sample  of  both  traditional 

*  frame-structured  CAI  material  and  a  new  type  of  information-structured 
CAI  material.  The  resulting  compression  ratios  are  near  1.5  to  one 
for  both  types  of  materials.  The  simulation  program  was  modified  to 

*  apply  the  algorithm  to  the  lesson  files  of  a  particular  frame-structured 
CAI  subsystem  used  in  the  Air  Force  Phase  II  Base  Level  System.  The 
compression  in  this  case  was  found  to  be  1.3  to  one  because  of  the 
presence  in  the  lesson  file  of  uncompress ible,  frame  formatting  bytes. 
The  modified  simulation  program  was  also  used  to  take  letter  occurrence 
statistics  on  the  text  being  compressed.  From  these,  a  theoretical 
compression  was  calculated  using  a  probabilistic  model  of  the  com¬ 
pression  algorithm.  Theoretical  compression  was  within  two  per  cent 

of  measured  compression,  thus  verifying  the  model fs  applicability. 

The  report  closes  with  the  raising  of  some  questions  and  a  discussion 
of  future  work. 


in 


TABLE  OF  CONTENTS 


££ge 

I.  INTRODUCTION . ' .  1 

II.  CONCEPTS  IN  STATISTICAL  TEXT  DATA  COMPRESSION  .  .  2 

III.  SOME  TEXT  COMPRESSION  RESULTS .  7 

IV.  THE  SNYDERMAN-HUNT  COMPRESSION  ALGORITHM .  9 

V.  EXPERIMENTS .  11 

VI.  CONCLUSIONS,  QUESTIONS  AND  RECOMMENDATIONS.  ...  17 

APPENDIX  A:  ANALYSIS  OF  THE  SNYDERMAN-HUNT  TEXT 

COMPRESSION  ALGORITHM . 19 

APPENDIX  B:  TXTCMP  PROGRAM  LISTING .  22 

APPENDIX  C:  EXPERIMENTAL  MATERIAL .  25 

REFERENCES .  29 


v 


SECTION  I 


INTRODUCTION 

Presently,  lesson  material  for  Computer  Aided  Instruction 
(CAI)  occupies  considerable  disk  space  when  the  CAI  system  is 
brought  on-line.  For  example,  in  the  Computer-Directed  Training 
(CODIT)  subsystem  of  the  Air  Force  Phase  II  Base  Level  system,^  ' 
each  300  frame  lesson  is  stated^  '  to  occupy  121,600  bytes.  Even 
the  short,  TTComputer  Operator  Ts  Course"  contains  the  equivalent 
of  14  lessons;  other  courses,  such  as  the  personnel  course  con¬ 
tain  many  more.  Accordingly,  the  technological  area  of  test 
compression  is  being  reviewed  for  practical  methods  whereby  CAI 
data  bases  may  be  reduced  in  size  with  only  moderate  computational 
expense. 

Section  II  presents  an  elementary  discussion  of  statistical 
text  compression  and  some  indication  of  its  performance  on  English 
text.  However,  there  also  exists  a  simpler  compression  algorithm 
based  on  the  practical  fact  that,  although  data  characters  are 
stored  in  8-bit  bytes,  only  about  one  third  of  the  potential 
256  characters  are  actually  used  in  current  ADP  systems;  the 
remaining  two- thirds  characters  can  be  used  to  encode  frequently 
occurring  character  pairs  into  single,  unused  characters  thus 
obtaining  data  compression. 

This  report  describes  in  more  detail  a  simple,  practical 
compression  algorithm,  its  application  to  a  small  set  of  CAI 
data  base  material,  and  the  results.  Performance  of  the  algorithm 
is  modeled  and  the  model  is  experimentally  verified.  In  addition, 
a  short  discussion  in  Section  VI  provides  guidance  for  future  work 
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SECTION  II 


CONCEPTS  IN  STATISTICAL  TEXT  DATA  COMPRESSION 

A.  Bits 

Data  is  stored  in  bits  or  in  groups  of  bits,  called  bytes. 

One  ’'bit"  of  information  represents  the  outcome  of  single  yes  or 
no  decision.  One  bit  can  also  represent  a  binary  state  of  a  given 
situation.  An  ordinary  room  light  switch  can  store  one  bit  of 
information,  e.g.,  "on"  might  mean  ;"at  home"  and  "off"  might  mean 
"not  at  home." 

Croups  of  bits  can  represent  more  information.  Two  switches 
can  represent  two  sequential  binary  decisions,  i.e.,  four  outcomes, 
or  situation  states,  such  as  "on,  on";  "on, off";  "off,  on";  and 
"off,  off".  Three  switches  can  represent  eight  states  and,  in  general, 

N  switches  can  represent  2^  states.  A  byte  consisting  of  eight  bits 
can  represent  256  characters  such  as  A,  B,  C,  ...  1,  2,  3,  ....,  ?,  $, 
etc.  Data  is  generally  stored  one  character  to  a  byte.  Nine  channel 
magnetic  data-processing  tape  can  store  800  bytes  per  lineal  inch  of 
tape  because  the  eight  bits  of  the  byte  are  laterally  distributed  across 
the  tape,  along  with  a  ninth  bit,  called  a  parity  check  bit. 

B.  Entropy 

Entropy  is  a  property  of  the  units,  such  as  characters  or  symbols, 
which  make  up  data.  Entropy  is  a  measure  of  the  "surprisal",  or  information 
value,  of  a  symbol.  It  has  the  units  of  bits/symbol  and  a  common  designa¬ 
tion  of  H.  A  few  simple  examples  will  clarify  perhaps  the  intuitive  notion 
of  entropy. 

For  instance,  if  it  is  equally  likely  that  John  is  going  to  the 
seashore  or  the  mountains  this  summer,  and  we  hear  that  he  is  going  to 
the  mountains  we  are  moderately  informed,  or  shall  we  say,  surprised. 

In  this  situation,  the  symbols  'taountain"  and  "seashore"  have  for  us 
equal  information  value.  They  are  said  to  have  equal  entropy.  If,  on 
the  other  hand,  John  historically  goes  to  the  mountains  nineteen  summers 
out  of  twenty  and  we  hear  he  is  going  to  the  mountains,  we  are  not  terribly 
surprised  or  informed.  The  symbol  'fountain ",  in  this  case,  possesses 
a  low  entropy,  information  value,  or  surprisal  content.  If  we  hear  that 
John  is  going  to  the  seashore  we  are  quite  surprised  and  highly  informed 
of  the  happening  of  a  low  probability  event.  The  symbol  "seasore"  has 
a  high  entropy,  information  value,  or  surprisal  content.  The  entropy 
of  a  symbol  is  related  to  the  priori  occurrence  of  that  symbol. 
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The  mathematical  measure  of  entropy  of  the  ith 
symbol  in  a  data  set  is  given  by 

Hj=  -  log2  P£  (bits/ symbol)  (1) 

where  is  the  a  priori  probability  of  occurrence  of  the  ith 
symbol  xn  a  data  set.  A  symbol  occurring  \  the  time  (p.  =  0.5) 
has  an  entropy  of  one  bit/ symbol.  One  occurring  \  of  the  time 
(p  =  0.25)  has  an  entropy  of  two  bits/symbol.  ,0ne  1/8  of  the 
time  has  three  bits/symbol  and,  in  general  1/2K  of  the  time  has 
k  bits/symbol  entropy.  Also,  k  may  be  fractional  as  well  as 
integer,  depending  on  p^. 

C.  String  Data 

Much  data  is  transmitted  and  stored  in  the  form  of 
strings,  i.e.,  connected  sets  of  alphanumeric  characters,  or 
other  symbols,  issuing  from  an  information  "source”  and  bound 
ultimately  for  an  information  "sink".  Consider  a  source  capable 
of  generating  four  characters:  A,  B,  C,  and  D,  each  occurring 
1/4  of  the  time,  i.e.,  pa  =  pp  =  p  =  p.  =  0.25.  The  entropy 
of  all  characters  is  the  same  and  therefore  the  average  entropy 
of  the  source  is  also  two  bits/symbol.  Each  character  A,  B, 

C,  and  D  may  be  represented  in  transmission  by  two  on-off 
(binary)  pulses  and  in  storage  by  two  binarily  magnetized 
patches  on  a  computer  tape  or  disk  unit.  But  now  consider  a 
source  which  exhibits  an  unequal  distribution  of  A,  B,  C,  and 
D  symbols,  e.g.,  p^  =  0.4,  pB  =  0.3,  p^  =  0.2  and  pp  =  0.1. 

Using  equation  (1)  the  entropies  are  calculated  as  H,  =  1.32, 

Hg  =  1.74,  Hq  =  2.32  and  Hg  =  3.32,  all  in  bits/symbol.  The 
average  (or  expected  value  of)  the  source  entropy  Hc  is  given 
by 


Bs  =  0.4Ha  +  0.3  HR  +  0.2  Hc  +  0.1  Hg  (2) 

The  value  of  Hs  is  1.846  bits/symbol.  Note  that  this  value  is 
less  than  the  2  bits/symbol  average  source  entropy  of  the 
"equally  likely"  source.  A  still  more  uneven  occurrence  distri¬ 
bution  than  that  given  above  would  result  in  a  smaller  source 
entropy. 


Although  it  is  not  obvious,  the  above  source  entropy 
value  does  lead  us  to  suspect  that  we  can  find  a  code,  i.e., 
a  mapping,  between  A,  B,  C,  D  and  four  groups  of  one  or  more 
bits  each  such  that  the  average  number  of  bits  per  code  group 
is  not  only  close  to  the  source  entropy  but  also  is  less  than 
a  straight  two  bits  per  character.  This  is  indeed  the  case  and 
the  code  is  as  follows: 
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A  =  1  (one  bit/ symbol) 

B  =  01  (two  bits/symbol) 

C  =  001  (three  bits/symbol) 

D  =  000  (three  bits/symbol).  The  coded  source  sequence: 

11011001011000  -  •  • 
in  uniquely  decodeable  as  the  original  source  sequence: 

AABACBAD*  •  •  . 

Considering  the  probability  of  occurrence  of  A,  B,  C, 
and  D  we  obtain  an  average  code  length  of  1.9  bits/symbol. 

This  represents  a  slight  compression  of  1.052  relative 
to  the  original  two  bits/symbol.  A  more  unequal  character 
occurrence  distribution  would  result  in  a  higher  compression 
ratio.  Thus,  we  see  that  data  compression  can  involve  measur¬ 
ing  the  statistics  of  source  symbol  occurrence,  designing  an 
efficient  code,  and  designing  both  an  encoder  and  a  decoder, 
implementable  in  either  hardware  or  software. 

D.  Block  Source  Encoding 

Another  form  of  encoding  is  to  group  symbols  of  a 
source  string  into  blocks.  Consider  an  example  where  the 
string  consists  of  specification  of  right  or  left-handedness 
and  that,  for  our  sample,  right-handers  outnumber  left-handers 
by  19  to  one.  The  probability  of  a  right-hander,  P^,  is  0.95 
and  the  probability  of  a  leff-hander,  Py,  is  0.05.  Simple  symbol 
encoding  of  TT0tT  for  R  and  r;l"  for  L  yields  an  average  code  word 
length  of  one  bit/ symbol.  But  the  entropy  of  a  binary  source 
with  .95  and  .05  probabilities  is  only  0.286  bits/symbol.  This 
suggests  that  we  can  do  better  than  merely  encode  R  and  L  into 
"0"  and  "1".  However,  it  also  suggests  that  the  very  best  we 
can  do  is  to  obtain  a  compression  of  about  3.5. 

Now  consider  coding  blocks  of,  let  us  say,  three  symbols. 

We  now  get  a  new  data  source  by  grouping  the  old  data  source 
into  blocks  of  three.  The  new  data  source  emits  eight  different 
symbols  1,  2,.  .  .,8  each  representing  a  possible  combination 
of  three  of  the  symbols  from  the  old  data  source.  The  probabilities 
of  symbol  occurrence  for  the  new  data  source  are  derivable  from 
the  probabilities  of  symbol  occurrence  from  the  old  data  source. 
Assuming  the  occurrence  of  any  symbol  is  independent  of  the 
occurrence  of  the  previous  symbol  we  obtain,  for  example,  the 
probability  of  symbol  3  ("RLR"  -  "010")  from  the  product  of 
probabilities 
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PR  .  PL  .  Pr  =  (0.95)  (0.05)  (0.95)  =  0.04513. 
The  results  are  summarized  in  Table  1. 

Table  I 

TABLE  OF  BLOCK  ENCODING  ENTROPIES 


Symbols  from 
Old  data  source 

000 

001 

010 

Oil 

100 

101 

110 

111 


Symbols  from 
New  data  source 

1 

2 

3 

4 

5 

6 

7 

8 


Probability  of 
New  source  symbol 

.85738 

.04513 

.04513 

.00237 

.04513 

.00237 

.00237 

.00012 


H  of  new 
source  symbol 

.22199 

4.46977 

4.46977 

8.72090 

4.46977 

8.72090 

8.72090 

13.02468 


H  =  0.85906  for  the  new  source. 


The  entropy  of  the  new  source  is  0.85906  bits/new  source 
symbol.  Notice  that,  on  the  basis  of  three  old  source  symbols 
to  one  new  source  symbol,  the  entropy  is  also  .2863  bits/old 
source  symbol.  However,  now  we  have,  with  eight  symbols  instead 
of  only  two,  more  freedom  to  design  an  efficient  code.  There 
exists  a  technique  which  allows  the  construction  of  a  code  whose 
coded  entropy  is  within  one  bit/ symbol  of  the  entropy  of  the 
original  source.  For  a  block  length  of  one  the  code  is  simply 
one  bit  in  length  for  each  source  symbol  "0"  and  "1",  hence, 
the  coded  source  entropy  is  one  bit/symbol  which  is  within  one 
bit/symbol  of  the  source  entropy,  which  must  lie  between  zero 
and  1.00.  Table  2  gives  the  efficient  code. 
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TABLE  2 


Table  of  Efficient  Code  for  Block  Length  =  3 


Source 

Symbol 

Code 

Length 

Probability  of 
Occurrence 

"Expected  value" 
of  Symbol  Length 

1 

1 

1 

.85738 

.85738 

2 

00 

2 

.04513 

.09026 

3 

010 

3 

.04513 

.13539 

4 

0111 

4 

.04513 

.18052 

5 

01100 

5 

.00237 

.01185 

6 

011010 

6 

.00237 

.01422 

7 

0110111 

7 

.00237 

.01659 

8 

0110110 

7 

.00012 

.00096 

1.10717 

Average  symbol  length  =  1.10717  bits/new  source  symbol. 

From  the  above  table  we  see  that  the  average  code  word 
length  is  now  1.10717  bits/new  symbol  and  this  quantity  represents 
three  old  symbols,  such  as  RLR,  This  code  yields  a  compression 
of  2.71  to  one  compared  with  a  maximum  possible  compression  of 
3.5  to  one.  The  use  of  longer  blocks,  and  more  complex  codes, 
will  result  in  a  closer  approach  to  the  maximum  possible  com¬ 
pression  figure.  In  this  example  we  have  assumed  independence 
of  symbol  occurrence.  Should  there  be  any  symbol  occurrence 
dependence,  resulting  in  lower  entropy,  block  encoding  will  pick 
up  this  advantage  also.  Thus,  we  see  that  data  compression  not 
only  involves  measuring  original  source  occurrence  probabilities 
and  devising  efficient  codes  but  also  blocking  the  original 
source  sequence  into  reasonable  lengths,  treating  these  as  a 
new  source,  and  then  devising  an  efficient  code  based  on  the 
probabilities  of  the  new  source. 
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SECTION  III 


SOME  TEXT  COMPRESSION  RESULTS 

(3 ) 

Shannon  gives  us  an  estimate  of  the  entropy  of 
English  text  as  a  function  of  how  many  previous  letters  are 
allowed  to  be  known.  An  upper  bound  on  compression  can  be  cal¬ 
culated  by  dividing  this  entropy  into  the  entropy  of  a  source 
which  puts  out  all  letters  randomly  with  equal  probability. 
Table  3  gives  entropies  and  compressions. 

Table  3 

Entropies  and  Compressions  of  an  English  Text  Source  Under 

Various  Constraints 

Constraint  Entropy  (bits/letter )  Compression 


None,  26  letters  and  one 
space  equiprobable 

4.76 

1 

Letter  and  space  frequencies 

4.03 

1.18 

One  letter  known 

3.32 

1.43 

Two  letters  known 

3.1 

1.53 

Word  frequencies  used 

2.14 

2.22 

Shannon  continued  his  investigation  of  english  entropy 
beyond  the  point  where  "N-grams"  of  english  were  known.  An  N- 
gram  is  a  histogram  giving  the  relative  frequencies  of  combina¬ 
tions  of  N  letters.  By  having  people  predict  the  next  letter 
when  shown  the  previous  L  letters,  Shannon  was  able  to  estimate 
entropies  of  english  for  constraint  lengths  close  to  100  letters. 
For  10  -  L  5  15  the  entropy  was  about  1.5  bits/letter  (compression 
=  3.17)  and  for  L  =  100  it  was  .95  bits/letter  (compression  =  5). 

Unfortunately,  compressors  using  constraint  lengths  of 
100  (^20  words,  or  so)  appear  completely  beyond  the  state-of- 
the-art.  However,  single  word  dictionary  type  compressors  do 
appear  feasible.  A  simulated  word  dictionary  compression 
algorithm  is  discussed  by  White  '  -'showing  results  of  compressions 
between  1.4  and  1.7  to  one  with  a  "small"  dictionary  and  two 
to  one  with  a  1000-word  dictionary.  For  a  restricted  vocabulary 
situation,  as  elementary  training  and  drill CAI  may  produce,  we 
probably  can  take  two  to  one  as  a  working  value  for  statistical, 
word  text  compression.  This  figure  compares  favorably  with 
Shannon's  figure  of  2.22  for  word  frequency  compression. 
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Consider  now  the  algorithm  which  is  the  object  of  this 
report,  Snyderman  and  Hunt  '  '  report  on  a  practical  text 
compression  algorithm,  used  at  the  Science  Information  Ex¬ 
change,  Smithsonian  Institution,  to  compress  the  text  portion 
of  a  200,000  record  on-line  file  from  an  average  of  851  to 
553  characters  per  record.  This  represents  an  implemented 
compression  of  1.54  relative  to  eight  bits/character,  a  very 
respectable  figure. 

The  average  of  8/1.54  =5.2  bits/character  represents 
a  net  compression  of  1.245  to  one  relative  to  6.46  bits/ 
character  for  88  equal  frequency  characters.  This  net  com¬ 
pression  lies  between  Shannon’s  theoretical  compression  (1.18) 
for  an  english  text  source  when  letter-space  frequencies  are 
known  and  the  compression  (1.43)  when  one  previous  letter  is 
known.  In  summary,  the  literature  indicates  character  text 
compression  at  around  1.5  to  one  and  word  text  compression  at 
around  2  to  one. 
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SECTION  IV 


THE  SNYDERMAN-HUNT  COMPRESSION  ALGORITHM 


This  section  discusses  more  formally  the  Snyderman- 
Hunt  algorithm.  The  algorithm  was  chosen  to  evaluate  compress¬ 
ibility  of  CAI  material  because  of  its  practicality,  its  demon¬ 
strated  performance  on  english  text  and  its  speed.  The  speed 
of  this  algorithm  on  a  360/40  is  on  the  order  of  65-75  milli¬ 
seconds  per  thousand  characters,  compressing  or  decompressing. 

It  operates  on  the  following  principles . 

Characters  are  normally  stored  one  per  8-bit  byte.  With 
eight  bits,  one  of  2°  =  256  characters  can  be  specified  by  each 
byte.  At  the  Scientific  Information  Exchange  only  88  characters 
are  used:  52  upper  and  lower  case  alphabetics,  10  numerics  and 
26  special  characters  such  as  comma,  period,  dollar  sign,  etc. 
This  leaves  256-88  =  168  "unused"  characters.  These  otherwise 
unused  8-bit  combinations  can  be  utilized  to  represent  the  more 
commonly  occurring  pairs  of  characters  in  the  88  used  character 
set,  thus  effecting  a  compression. 

More  specifically,  it  is  convenient  to  define  four  sets: 


T  -  jail  256  possible  characters} 

C  -  factual  characters  used] 

CC  -  |combining  characters}  (3) 

MC  -  ^master  characters} 

These  sets  are  related  as  follow: 

MC  c  CC  C  C  C  T  (4) 

A  further  set  CP,  for  "combined  pairs'.!,  can  be  formed 
of  all  ordered  pairs  of  MC  and  CC,  i.e. 

CP  =  {mc}  X  {cc}  .  (5) 

The  members  of  CP  can  be  placed  in  one-to-one  correspondence 
with  the  difference  set  D  defined  as 

D  =  (T  -  C}  .  (6) 

The  set  of  noncombining  characters  NC  is  given  as 

NC  =  {C  -  CC}  t  (7) 
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For  example  Snyderman  and  Hunt  choose: 

MC  =  {space,  A,  E,  I,  0,  N,  T,  uj  (8) 

CC  =  {space,  A  through  I,  L  through  P,  R  through  w)  (9) 

The  set  MC  has  8  members;  CC  has  21.  The  set  of  all  combined 
pairs  CP  has  8  x  21  =  168  members  which  are  one-to-one  related 
to  the  168  members  of  difference  set  D. 

The  algorithm  works  by  examining  a  character  in  a 
string.  If  the  character  is  a  member  of  MC  the  next  character 
is  examined.  If  the  next  character  is  a  member  of  CC  then  the 
two-character  combined  pair  is  coded  into  a  single  unused 
character  and  stored.  If  the  first  character  is  not  a  member 
of  MC,  it  is  stored  as  is.  If  the  first  character  is  a  member 
of  MC  but  the  second  one  is  not  a  member  of  CC  then  the  two 
characters  are  stored  individually,  as  is.  Thus  we  see  that 
compression  is  dependent  upon  both  the  probability  of  finding 
a  master  character  and  the  conditional  probability  of  finding 
a  combining  cliaracter  given  the  finding  of  a  master  character. 
An  analysis  of  the  algorithm  is  presented  in  Appendix  A. 
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SECTION  V 


EXPERIMENTS 


A.  Experiment  One 
1.  Description 

A  computer  program  was  written  to  simulate  the  Snyder- 
man  -Hunt  algorithm.  The  simulation  did  not  actually  code  the 
characters,  but  rather  "kept  score"  on  the  number  of  characters 
that  the  algorithm  would  output  for  each  line  of  input  text. 
Compression  ratio  is  the  number  of  characters  input  divided  by 
the  number  of  characters  output.  The  program,  called  TXTCMP, 
is  interactive,  being  implemented  in  GP9  (a  subset  of  PL/l) 
for  operation  from  a  TTY  or  IBM  2741  terminal.  TXTCMP  is  fed 
a  line  of  text  at  a  time  and  returns  both  line  compression  and 
total  compression  since  the  start  of  the  program.  The  program 
listing  and  flow  chart  is  reproduced  in  Appendix  B. 

The  experimental  material  was  chosen  from  two  different 
types  of  CAI  data  bases:  frame-structured  and  information- 
structured.  The  former  was  taken  from  the  Computer  Operator's 
course  of  reference  1,  the  latter  from  reference  6.  Both  are 
reproduced  in  Appendix  C.  The  lines  were  entered  exactly  as 
shown  in  Appendix  C,  spaces  included,  from  the  left  most 
character  position  as  a  reference,  and  the  compressions  were 
obtained.  In  this  experiment  the  sets  chosen  by  Snyderman  and 
Hunt  for  master  characters  and  for  combining  characters  were 
used.  The  set  of  noncombining  characters  in  this  experiment 
was  everything  else  on  the  IBM  2741  keyboard  recognized  by 
CPS. 


In  the  Snyderman-Hunt  application,  88  characters  were 
valid,  leaving  168  for  encoding  character  pairs.  The  Snyder¬ 
man-Hunt  algorithm  can  be  applied  to  compressing  text  in  CPSW) 
because  CPS  also,  uses  or  admits  in  characters  strings,  88 
characters,  leaving  168  for  encoding  character  pairs.  These 
results  also  apply  to  compressing  text  in  the  CODIT  (Computer 
Directed  Training)  system  because  it  is  written  into  the  Air 
Force  Phase  II  Base  Level  System  via  the  Burroughs  B3500  COBOL 
language.  COBOL  uses  53  <  88  characters,  leaving  203 >168  char¬ 
acters  for  encoding  character  pairs.  Indeed  compression 
might  be  slightly  better  when  implemented  in  the  B3500  environ¬ 
ment,  because  the  203  unused  characters  will  accommodate  25 
combining  characters  as  opposed  to  only  21  in  reference  5. 
Alternatively  9,  rather  than  8,  master  characters  could  be 
accommodated  because  the  product  of  9  master  and  21  combining 
characters  is  less  than  the  203  characters  available. 
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2.  Results 


For  the  frame  structured  material  the  average  compression 
was  1.473,  with  individual  lines  (except  those  with  a  single 
space)  ranging  between  1.148  and  1.700.  For  the  information- 
structured  system  material  the  average  compression  was  1.538 
with  a  low  of  1.261  and  a  high  of  1.875  for  individual  lines. 

There  is  no  particular  accounting  for  the  slight  (4.4%) 
difference  in  average  compression,  because  the  spread  in 
individual  line  compression  is  quite  large  in  both  cases, 
with  considerable  overlap.  From  Figure  1  it  is  seen  that  the 
average  compression  settles  statistically  within  a  few  lines. 
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B.  Experiment  Two 
1.  Description 

The  objective  of  experiment  two  is  to  obtain  an  estimate 
of  compression  for  the  Snyderman-Hunt  algorithm  when  applied  to 
the  actual  lesson  file  structure  of  CODIT.  It  is  found  *  ''that 
the  lesson  file  of  CODIT  contains  both  file  structure  specifi¬ 
cation  bytes,  which  are  not  compressible,  and  lesson  text  bytes, 
which  are.  The  file  structure  bytes  occur  according  to  Table  4. 


File  Structure  Bytes 


Application 
Frame  Number 
Frame  Type 
Frame  Length 
Group  Number 
Group  Length 
Line  Number 
Line  Length 


Number  of  Bytes 
4  (per  frame) 

2  (per  frame) 

2  (per  frame) 

1  (per  group) 

2  (per  group) 

4  (per  line) 

3  (per  line) 
Table  4 


The  program  TXTCMP  was  modified  (TXCP2)  to  add  Overhead" 
bytes  to  the  compression  calculation  in  the  amount  of  14  +  3  x 
number  of  groups  +  7  x  number  of  lines  each  time  a  new  frame 
of  CAI  material  was  encountered.  As  an  example,  the  CODIT  print¬ 
out  shown  in  Figure  1  of  Appendix  C  contains  three  frames  with 
frame  two  containing  three  groups  and  six  (numbered)  lines. 


When  the  CODIT  CAI  material  was  entered,  only  the 
numbered  lines  were  entered  for  the  compression  calculation. 

It  will  be  recalled  that  in  experiment  one  all  lines  as  shown 
in  the  figure  were  entered.  The  line  numbers  and  the  two  spaces 
beyond  were  not  entered;  only  the  text  (course  author  generated) 
to  the  right  of  this  point  is  used.  This  is  because  all  other 
(formatting)  characters  can  be  accounted  for  by  the  CODIT 
master  program  reading  the  "overhead"  bytes  and  producing 
therefrom  the  non-text  characters  in  the  printout.. 


2.  Results 


The  total  CODIT  subsystem  compression  for  the  material 
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in  Figures  1  and  2  of  Appendix  C  is  1.318.  While  this  com¬ 
pression  is  less  than  that  obtained  using  all  the  characters 
in  Figures  1  and  2,  it  is  a  more  realistic  value  because  the 
CODIT  file  structure  "overhead"  bits  are  taken  into  account. 

Also,  it  is  a  conservative  (low)  value  because  the  frames  in 
the  experimental  set  have  very  little  expository  text  material. 

The  frames  are  largely  for  questioning  the  trainee  rather  than 
for  instructing  him.  One  can  reasonably  expect  an  experimental 
set  containing  a  mix  of  questioning  frames  and  instructing 
frames  to  yield  a  higher  compression,  Even  so,  the  1.318  figure 
lias  useful  implications.  In  the  CODIT  subsystem  it  means 
reducing  each  121,600  byte  lesson  file  by  about  28,000  bytes 
or,  alternatively,  putting  30%  more  lessons  on  disk  for  the 
same  CAI  file  allocation  in  the  Air  Force  Phase  II  Base 
Level  System.  Putting  more  lessons  on-line  gives  increased 
daily  flexibility  to  the  OJT/CAI  program.  Using  less  disk  for 
CAI  increases  the  chances  for  its  acceptance  since  it  leaves 
adequate  disk  space  for  the  other  functional  areas,  such  as 
personnel,  finance  and  civil  engineering. 

C.  Experiment  Three 

1.  Description 

The  objective  of  experiment  three  is  to  verify  the  anal¬ 
ytical  model  of  the  Snyderman-Hunt  algorithm  developed  in  Appendix  A. 
The  essence  of  the  model  is  equation  (7),  Appendix  A,  which  pre¬ 
dicts  compression  on  the  basis  of  p-^,  the  probability  of  a  master 
character  occurring,  and  p^p2  the  joint  probability  of  both  a 
master  and  a  combining  character  occurring  together.  Should 
the  model  be  verified  to  an  engineering  degree  of  accuracy,  it 
would  then  be  possible  to  select  more  easily  optimum  master 
and  combining  characters  sets  because  p^  is  simply  related  to 
single  letter  and  space  relative  occurrences  in  english  and 
p  P2  is  also  simply  related  to  double  letter  and  space  relative 
occurences.  When  TXCMP  was  developed  into  TXTCP2,  provision 
was  made  to  measure  p^  and  P2  and  p^p^  on  the  text  portion  of 
the  experimental  material,  theoretical,  or  predicted,  text 
compression  was  calculated.  The  experimental  material  used 
was  the  text  portion  (1003  characters)  of  Figures  1  and  2  of 
Appendix  C. 

2.  Results 

Using  the  text  material  only,  i.e.,  no  CODIT  subsystem 
"overhead"  bytes  considered,  it  is  found  that  the  material  of 
Figures  1  and  2,  Appendix  C,  yield  P  =  .566,  p^pn  =  .531  and 
a  theoretical  compression  of  1.513.  This  value  compares  quite 
well  (within  2%)  of  the  experimentally  measured  text  compression. 
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1.530.  Furthermore,  examination  of  cumulative  measured  text 
compression  and  cumulative  theoretical  text  compression  as 
it  builds  up  on  a  line-by-line  basis  shows  that  the  compression 
predicted  by  equation  (7)  of  Appendix  A  is  stable  and  always 
within  2.5  per  cent,  thus  indicating  a  valid  model  for  the 
Snyderman-Hunt  algorithm. 
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SECTION  VI 


CONCLUSIONS,  QUESTIONS  AND  RECOMMENDATIONS 


Based  on  these  results,  three  major  conclusions  follow: 

1.  A  working  figure  of  1.5  may  be  taken  for  the  practical 
compression  of  CAI  text  material. 

2.  When  frame  formatting  overhead  bytes  are  taken  into 
account  in  a  typical  CAI  system,  the  compression  figure  becomes, 
conservatively,  1.3  to  one. 

3.  It  is  possible  to  adequately  model  the  Snyderman-Hunt 
algorithm  and  predict  compression  performance  within  a  few  per 
cent,  based  on  text  statistics. 

Given  these  conclusions  several  timely  questions  may  be 
raised: 

How  can  the  Snyderman-Hunt  algorithm  be  optimally  applied 
to  CODIT  which  is  now  being  implemented  Air  Force-wide?  Where 
w.ould  the  compression  and  decompression  algorithm  be  inserted 
into  the  CODIT  system  flow  diagram  (pg.  50  of  reference  l)? 

Can  you  patch  a  B3500  assembly  language  compression  decompression 
algorithm  into  a  compiled  COBOL  CODIT  program?  Given  that  COBOL 
uses  only  53  characters,  what  is  now  the  optimum  master  and  com¬ 
bining  character  sets?  What  is  the  dollar  saving  in  reduced 
disk  files  and  magnetic  tapes?  By  how  much  is  this  dollar  saving 
offset  by  the  75-odd  microsecond  per  character  CPU  time  cost? 

The  dollar  saving  questions  can  be  approached  in  two  ways: 

1.  By  taking  gross  costs  from  the  current  B3500  Base  Level 
System  installation  with  estimates  of  CAI  file  space,  CAI 
character  throughput,  and  B3500  speed  for  compressing  and  de¬ 
compressing,  it  is  possible,  in-house,  to  arrive  at  a  rough 
estimate  of  dollar  saving. 

2.  By  putting  this  problem  to  industry  as  a  contracted  study 
wherein  the  contractor  designs  an  optimal  compression  system 
based  on  extensive  CAI  data  base  material,  does  a  preliminary 
system  design  around  current  or  projected  hardware,  and  cal¬ 
culates  relative  costs  of  going  compressed  and  uncompressed 
within  the  system. 

It  is  recommended  that  (l)  above,  be  accomplished  and, 
based  on  the  outcome,  (2)  be  considered,  perhaps  as  part  of 
contract  definition  for  Air  Force  systems  beyond  B3500.  It 
is  also  recommended  that  text  compression  be  considered  if 
CODIT  is  rewritten  in  JOVIAL  for  DAFCCS  application.  Finally 
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it  is  recommended  that  the  Snyderman-Hunt  algorithm  be  ex¬ 
perimentally  applied  to  other  Air  Force  textual  data  bases, 
such  as  intelligence. 


* 
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APPENDIX  A 


Analysis  of  the  Snyderman-Hunt 
Text  Compression  Algorithm 


Consider  a  string  of  N  characters.  As  a  character  is 
examined  to  see  if  it  is  a  master  character,  there  is  the 
possibility  that  either  one  or  two  characters  will  be  read  in. 
Let  p^  be  the  probability  that  the  character  examined  is  a 
master  character  and  1-p^  the  probability  it  is  not.  If  the 
character  is  a  master  character,  then  a  second  character  will 
be  read  in;  if  it  is  not,  then  only  the  single  character  is 
read  in,  and  the  cycle  repeated,  The  expected  number  of 
characters  input,  per  cycle,  is  given  by 


ECI  =  2  (Pl)  +  1  (1-p,) 

(1) 

=  1  +  Pl  . 

For  a  string  of  N  characters  the  number  of  read  cycles  R  is 
given  by 


R  = 


N 

1  +  Pl 


(2) 


When  a  master  character  is  found,  with  probability  p-^, 
two  possibilities  exist:  the  next  character  will  be  a  combining 
character,  or  it  will  not.  Let  P2  be  the  probability  that  the 
next  character  will  be  a  combining  character  and  1  -  P2  that 
it  will  not.  If  the  second  character  is  a  combining  character, 
it  will  be  combined  with  the  master  character  and  only  one 
character  will  be  read  out.  If  the  second  character  is  not  a 
combining  character,  then  two  characters  will  be  read  out.  If 
the  first  character  is  not  a  master  character  only  one  character 
will  be  read  out.  These  rules  lead  to  the  expected  number  of 
characters  output  per  cycle,  being  given  by 

ECO  =  1  (Pl  p2)  +  2  (Pl  (l-p2)  )  +  1  (1-Pl) 

=  1  +  Pi"  ?!  P2 


The  expected  number  of  characters  read  out  NO,  per  line 
of  N  characters  in,  is  given  by 


NO  =  R  (ECO) 

NO  =  R  (1  +  Pl  -  Pl  p2) 


(4) 

(5) 
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Compression  C  is  defined  as  the  number  of  characters 
N  in  the  line  divided  by  the  number  of  characters  NO  read  out 
from  the  line  processing,  i.e. 


C  = 


N 

w 


(6) 


Substituting  previous  work  in  the  above,  we  relate  expected  com¬ 
pression  to  the  probabilities  p-^  and  p2* 

C  =  R  (1  +  px)  /  R  (1  +  PX  -  PX  P2) 

1  +  Pl 

1  +  PI  -  Pi  P2 

Note  that  if  all  first  read  characters  are  master 
characters,  p-.  =  1,  and  if  all  second  read  characters  are  com¬ 
bining  characters,  p„  =  1,  then  C  is  a  maximum  and  equal  to  2. 

On  the  other  hand,  ir  no  master  characters  occur,  p^  =  0,  then 
compression  is  at  a  minimum  and  equal  to  unity.  Since  p^  is 
the  probability  of  finding  a  master  character  p(MC)  and  p  is 
the  probability  p(CC/MC)  of  finding  a  combining  character, 
given  a  master  character,  we  see  that  p, p2  is  the  joint  proba¬ 
bility  p(MC,CC)  of  finding  a  master  character  and  a  combining 
character  together.  Both  p(MC)  and  p(MC,CC)  can  be  experimentally 
determined  for  a  given  data  base,  such  as  english,  once  a  table 
of  first  and  second  order  occurrences  is  compiled  and  the  sets 
of  master  characters  and  combining  characters  are  defined.  The 
sets  can  be  adjusted,  within  the  constraints  given  in  the  text, 
to  maximize  the  expected  compression. 
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APPENDIX  B 


TXTCMP  Program  Listing 


Note  1:  The  program  operates  by  working  its  way  (via  POINT) 
through  the  LINE  of  text,  character  by  character.  If  a  master 
character  is  not  found,  both  the  compressed  and  uncompressed 
bit  count  are  augmented  by  one  byte:  if  a  master  character  is 
found,  the  next  character  is  tested  for  being  a  combining 
character.  If  the  next  character  is  a  combining  character,  the 
compressed  bit  count  is  augmented  by  one  byte  and  the  uncom¬ 
pressed  bit  count  by  two  bytes,  otherwise  both  compressed  and 
uncompressed  bit  counts  are  augmented  by  two  bytes.  An  isolated 
master  character  at  the  end  of  LINE  will  be  so  identified  (pro- 
gram  line  350)  and  cause  augmentation  of  both  compressed  and 
uncompressed  bit  counts  by  one  byte.  Success  of  the  end  of  line 
test  initiates  printout. 

Note  2:  Program  line  426  is  not  essential  to  operation; 
it  merely  prints  the  value  of  POINT  occasionally  to  let  you 
know  the  program  is  functioning  during  the  wait  between  line 
input  and  compression  printout. 

Note  3:  Variable  listing 

Variable  Explanation 


M(l) . Master  Character  array 

CC(I) . Combining  character  array 

TUC . Number  of  bits,  uncompressed,  from  beginning 

of  program 

TC . Number  of  bits,  compressed,  from  beginning 

of  program 

LINE . Character  variable  containing  a  line  of  text 

UC . Number  of  bits,  uncompressed,  in  a  given  line 

C  . Number  of  bits,  compressed,  in  a  given  line 

POINT . A  text  pointer  variable 

TESTl . A  character  variable  containing  one  char¬ 

acter  being  tested  to  see  if  it  is  a  master 
character 

I . A  general  indexing  variable 
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TEST  2 . A  character  variable  containing  one  char¬ 

acter  beomg  tested  t<?  see  if  it  is  a  com¬ 
bining  character. 

TOTCMP . Total  Compression  since  beginning  of  program 

LNECMP . Compression  of  the  given  line  above 

Note  4:  Label  Listing 

Label  Explanation 

TXTCMP . The  name  of  the  program:  "Tact  Compression" 

LNEGET . Get  a  new  line  of  text 

CHRGET . Get  a  new  character  from  the  line 

NXTCHR . Get  the  next  character  (following  an  identi¬ 

fied  master  character) 

AUGMT2 . Augment  the  bit  count  by  2  bytes  (16  bits) 

AUGMTl . Augment  the  bit  count  by  1  byte  (8  bits) 

EOLTST . End  of  line  test 

EOL  . End  of  line 


2? 


F  LOW  CHART  FOR  TXTCMP 

I  I ( '. I  Kl.  11- I  ruw  CHART  rou  IXTCMP 
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5. 
10. 
15. 
20. 
25. 
in. 
15. 
40. 
<•5 . 
10. 
55. 
60. 
85. 
70. 
75. 
80. 
85. 
90. 
95. 
100. 
105. 
110. 
115. 
12n. 
125. 
130. 
135. 

no. 

14  5 . 
150. 
155. 
160. 
165. 
170. 
175. 
180. 
185. 
190  . 
195. 
200. 
205. 
280  . 
285  . 
290  . 
295  . 
300. 
305. 
310. 
315. 
320. 
325. 
335  . 
340  . 
3«*5 . 
348  . 
350. 
355  . 
360  . 
365  . 
370. 
380. 
385  . 
390. 
395  . 
too. 
H05 . 
HI  0 . 
>>15 . 
420. 
425. 
426  . 
430. 
435. 
440. 
445  . 
450. 
455  . 
480  . 
485. 
470. 
475. 
48  0. 


TXTCMP:  PROCEOURE  ; 

/•THIS  PROGRAM  S1WLATES  THE  TEXT  COMPACTION  ALGORITHM  OF*/; 

/•SNYOfRMAN  AMO  HU*T,  OATAmaTION,  DEC  1,  1970.*/; 

PUT  L  I  ST  ( '  '); 

PUT  L IST( *  EXECUTING  TEXT  COMPPE SS I  ON . • >; 

WJT  L 1  ST ( *  PLEASE  NOTE :  PWTER  ALL  CHARACTERS  IN  UPPERCASE. ') ; 

PUT  LlST('ALSO  NOTE !  LIMIT  LINE  TO  70  CHARACTERS.'); 

PUT  LISTCMIT  ATTN  ON  I  BM  2741  TERMINAL  OR  BREAK  ON  TTY  TO  ENO  PROQRAM. '  ) ; 
PUT  L  1  ST ( '  '); 

OFCLARE  M(8)  CMAR(2),  LINE  CHAP(  70)  VAR; 

OECLARE  CC ( 21)  CHAR(l); 

OECLARE  TEST1  CMAR(i),  TEST2  ChaR<1); 

M( 1 )■ 1 
M  (  2 )  ■  1  A  ' ; 

M(3)-*  E'; 

M( 4  )■ '  1  ' ; 

M(  5 )* 1 0 ' ; 

M(6)*'N'; 

M( 7 ) - ' T* ; 

M(8)*'U'j 

CC(1)*' 

CC ( 2 )■ * A  ' ; 

CC(3)-'B'; 

CCCO-'C'; 

CC( 5 ) ■ *0  ' ; 

CC( 6 ) ■  '  E  ' ; 

CC( 7 ) ■' F  ' ; 

CC<  8 ) - ' G ' ; 

CC( 9 )■ ' H* ; 

CC(10)-' I  '; 

CCdD-'L'; 

CC ( 12)*'M'; 

CC( 13 ) ■ *  N*  j 
CC(  14 ) ■ ' 0 ' ; 

CC( 1 5 ) ■ '  P* ; 

CC( 18 )■ ' R 1 ; 

CC  <  1  7 )- *  S' ; 

CC  ( 18 ) - *  T  * ; 

CC ( 19 ) ■ *U' ; 

CC ( 20 ) ■ 1 V 1 ; 

CC( 21)*'H'; 

TuC*0; 

TC-O; 

LNEGET:  PUT  L1ST< 'LINE'  ); 

REAO  INTO(LINE)  ; 

UC-O; 

C*0; 

POINT-1; 

CHRGETt  TESTl*SUflSTR(L1NE,  POINTED; 
fUC*TUC*8; 

UC*UC*8; 

00  1*1  TO  8; 

IF  TEST1*H(I)  THEN  GO  TO  NXTCHP; 

ENO  ; 

GO  TO  AUOMTl; 

NXTCHR ;  IF  PO I  NT* LENGTH (LINE)  THEN  GO  TO  AUGMTl; 

POI NT*  POI NT*l ; 

TFST2*SU6STR  ( LINE,  POINTED; 

TUC*TUC*8; 

UC*UC*8 ; 

00  1*1  TO  21; 

IF  TEST2-CC ( I )  THEN  GO  TO  AUGMTl; 

ENO  ; 

AUGMT2:  C* C*16; 

TC-TC*16; 

GO  TO  EOLTST; 

AUGMTl;  C*C*8; 

TC-TC-8; 

FOLTST:  IF  POI  NT-LENGTH(  L  I  *E )  T»«FV  GO  TO  EOL; 

PO  INT-PO  INT*1; 

IF  PO I  NT /6*T9UNC( PO I  NT/6 )  THEN  PUT  LIST(POINT); 

GO  TO  CHRGET; 

EOL;  TOTCMP-TUC/TC; 

LNECMP-UC/C; 

PUT  L  I  ST  (  *  '); 

PUT  LIST('LINE  COMPRESSION'); 

PUT  LIST(LNECMP); 

PUT  LIST( 'TOTAL  COMPRESSION* ) ; 

PUT  LIST(TOTCMP); 

PUT  LISTC  '); 

GO  TO  LNEGET; 

ENO  TXTCMP; 


riGUKE  H-2  CPS  LISTING  OP  PROCRAM 
TXTCMP:  "TEST  COMPRESSION" 
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APPENDIX  C 


EXPERIMENTAL  MATERIAL 


Reproduction  of  frame-structured  and  information-structured 
CAI  material.  All  parts  of  all  lines  containing  one  or  more 
characters  constitutes  the  experimental  set  for  experiment  one. 
Only  the  text  portions  of  numbered  lines  in  Figures  C-l  and  C-2 
constitute  the  experimental  set  for  experiments  2  and  3. 
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LESSON  0007000  DATE  WRITTEN  160569  PAGE  1 


FRAME  1.0  TYPE  Ml  LABEL  000700 
G.2  TEXT 

1.0  VIT  PROGRAMMING  LANGUAGES?? 

2.0  DO  YOU  WANT  TO  TRY  THE  LESSON  ON  PROGRAMMING  LANGUAGES 
3.0  OR  DO  YOU  THINK  YOU  CAN  SKIP  IT? 

G. 3  ANSWERS 

1.0  A+I  WILL  TRY  THE  LESSON  ON  PROGRAMMING  LANGUAGES. 

2.0  Bfl  THINK  I  KNOW  ENOUGH  TO  SKIP  IT. 

G. 4  ACTIONS 

1.0  A  F:FINE.  LET’S  BEGIN.  8:31 

2.0  B  FiWE'LL  GIVE  YOU  A  LITTLE  TEST  JUST  TO  MAKE  SURE. 
FRAME  2.0  TYPE  Ql  LABEL 

G.2  TEXT 

1.0  WHAT  DOES  COBOL  STAND  FOR? 

G.3  ANSWERS 

1.0  0  SET  KEWORD  ON 

2.0  0  SET  PHONETIC  ON 

3.0  0  SET  ORDER  ON 

4.0  A+ COMMON  BUSINESS  ORIENTED  LANGUAGE 

FRAME  3 . 0  TYPE  Ql  LABEL 

G.2  TEXT 

1.0  WHAT  DOES  FORTRAN  STAND  FOR? 

G.3  ANSWERS 

1.0  A+FORMULA  TRANSLATION 


FIGURE  C-l  CODIT  SUBSYSTEM  FRAME 
STRUCTURED  CAI  MATERIAL 
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LESSON  000700  DATE  WRITTEN  160569  PAGE  2 


FRAME  4.0  TYPE  Q1  LABEL 
G.2  TEXT 

/ 

1.0  WHAT  DOES  RPG  STAND  FOR? 

G. 3  ANSWERS 

1.0  A+REPORT  PROGRAM  GENERATOR 
FRAME  5.0  TYPE  Ml  LABEL 

G.2  TEXT 

1.0  WHAT  IS  MADE  UP  OF  l’S  AND  Q’S? 

G. 3  ANSWERS 

1.0  A+MACHINE  LANGUAGE 

2.0  B  PROCEDURE-ORIENTED  LANGUAGE 

3.0  C  RPG  LANGUAGE 

4.0  D  OCTAL  LANGUAGE 

5.0  E  NONE  OF  THE  ABOVE 

FRAME  6.0  TYPE  Q1  LABEL 

G.2  TEXT 

1.0  WHAT  DO  YOU  CALL  MACHINE-SPECIFIC  INSTRUCTIONS  USED  BY  A 
2.0  PROGRAMMER  SPECIALIST  TO  REPRESENT  EACH  MACHINE  OPERATION? 
3.0  (THE  WORD  ’MACHINE’  SHOULD  NOT  BE  INCLUDED) 

G. 3  ANSWERS 


1.0 

A+ MNEMONIC 

2.0 

fl+SYMBOLIC 

3.0 

C+SYMBOLIC  CODE 

FRAME 

7 . 0  TYPE  D1 

LABEL 

G.2  CONDITIONS 

1.0  IF  GQ  2  WRONG  2-6  F:  YOU’RE  OFF  TO  A  BAD  START.  YOU’D  BETTER 
2.0  F:TRY  THE  LESSON.  B:M0D7 


FIGURE  0-2  CODIT  SUBSYSTEM  FRAME 
STRUCTURED  CAI  MATERIAL 
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(RPAQQ  LATITUDE  (((ON  LATITUDE) 

(DET  THE  DEF  2)) 

NIL 

(SUPERC  NIL  (DISTANCE  NIL  ANGULAR  (FROM  NIL 
EQUATOR))) 

(SUPERP  (I  2) 

LOCATION) 

(VALUE  (I  2) 

(RANGE  NIL  -90  90)) 

(UNIT  (I  2) 

DEGREES))) 


(RPAQQ  ARGENTINA  (((XN  ARGENTINA) 

(DET  NIL  DEF  2)) 

NIL 

(SUPERC  NIL  COUNTRY) 

(SUPERP  (I  6) 

SOUTH/ AM ERICA) 

AREA  (I  2) 

(APPROX  NIL/120000000 

(LOCATION  NIL  SOUTH/ AMERICA  (LATITUDE  (I  2) 

(RANGE  NIL  -22  -55)) 

(LONGITUDE  (I  4) 

(RANGE  NIL  -57  -71)) 

(BORDERING/ COUNTRIES  (II) 

(NORTHERN  (II) 

BOLIVIA  PARAGUAY) 

(EASTERN  (II) 

( ($L  BRAZIL  URUGUAY 
NIL 

(BOUNDARY  NIL  URUGUAY/RIVER))) 

(CAPITAL  (II) 

BUENOS/AIRES) 

(CITIES  (I  3) 

(PRINCIPAL  NIL  ($L  BUENOS/AIRES  CORDOBA  ROSARIO 
MENDOZA  LA/PLATA  TUCUMAN))) 

(TOPOGRAPHY  (II) 

VARIED 

(MOUNTAIN/CHAINS  NIL  (PRINCIPAL  NIL  ANDES 

(LOCATION  NIL  (BOUNDARY  NIL  (WITH  NIL 
CHILE))) 

(ALTITUDE  NIL  (HIGHEST  NIL  ACONCAGUA 
(APPROX  NIL  22000)))) 

(SIERRAS  NIL  (LOCATION  NIL  ($L  CORDOBA 
BUENOS/AIRES)))) 

(PLAINS  NIL  (FERTILE  NIL  USUALLY) 

( ($L  EASTERN  CENTRAL) 

NIL  PAMPA) 

(NORTHERN  NIL  CHACO))) 

FIGURE  C-3  THE  UNITS  FOR  LATITUDE  AND  ARGENTINA  (FRAGMENTS)  IN  SCHOLAR, 
AN  INFORMATION  STRUCTURED  CAI  SYSTEM 
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